4-5/5/2021

1. Data Analysis

Data Analysis in the Scientific Cycle

Data-Intensive Research

  • Science and humanities are increasingly data-driven
    • Early-career training has not prepared all researchers for this

Research Workflows

  • Enable systematic, replicable and reproducible work
    • Design principles
      • Best practices for data
    • Software development methods
      • Automation of repetitive calculations

Pipelines and Workflows

Pipeline

  • What a computer does
    • A series of instructions
    • Data is piped through programs, and a result emerges

Workflow

  • What a researcher does
    • Exploring data, developing hypotheses, writing code, interpreting results
  • Outputs include:
    • datasets, methods, teaching materials, software, papers, etc.

Explore, Refine, Produce (ERP)

2. Welcome to R

Learning Objectives

  • Fundamentals of R and RStudio
  • Fundamentals of programming (in R)
  • Data management with the tidyverse
  • Publication-quality data visualisation with ggplot2
  • Reporting with RMarkdown

What is R?

  • R is:
    • a programming language
    • the software that interprets/runs programs written in the R language

Why use R?

  • free (though commercial support can be bought)
  • widely used
    • sciences, humanities, engineering, statistics, etc.
  • has many excellent specialised packages for data analysis and visualisation
  • international, friendly user community

What is RStudio?

Please start RStudio

  • RStudio is an integrated development environment (IDE)
  • Script/code editor; Project management
  • Interaction with R (console/‘scratchpad’); Graphics/visualisation/Help

“Why not use Excel?”

  • Excel is good for some things
  • R is excellent for analysis and reproducibility…
  • Separates data from analysis
  • Not point-and-click: every step is explicit and transparent
  • Easy to share, adapt, reuse, publish analyses with new/modified data (GitHub)
  • R can be run on supercomputers, with extremely large datasets…

RStudio overview - INTERACTIVE DEMO

Variables

Variables are like named boxes

  • An item (object) of data goes in the box (which is called Name)
  • When we refer to the box (variable) by its name, we really mean what’s in the box

Variables - Interactive Demo

x <- 1 / 40
x
## [1] 0.025
x ^ 2
## [1] 0.000625
log(x)
## [1] -3.688879
name <- "Samia"
name
## [1] "Samia"

Naming Variables

Variable names are documentation

current_temperature = 28.6
subjectID = "GCF_00001236452.1"
GPS_Location = "54N, 36E"
  • descriptive, but not too long
  • letters, numbers, underscores, and periods ([a-zA-z0-9_.])
  • cannot contain whitespace or start with a number (x2 is allowed, 2x is not)
  • case sensitive (Weight is not the same as weight)
  • do not reuse names of built-in functions
  • Consistent style:
    • lower_snake, UPPER_SNAKE, lowerCamelCase, UpperCamelCase

Functions

Functions (log(), sin() etc.) ≈ “canned script”

  • automate complicated tasks
  • make code more readable and reusable
  • Functions usually take arguments (input)
  • Functions often return values (output)
  • Some functions are built-in (in base packages, e.g. sqrt(), lm(), plot())
  • Groups of related functions can be imported as libraries

Getting Help in R

INTERACTIVE DEMO

args(fname)            # arguments for fname
?fname                 # help page for fname
help(fname)            # help page for fname
??fname                # any mention of fname
help.search("text")    # any mention of "text"
vignette(fname)        # worked examples for fname
vignette()             # show all available vignettes

Challenge 01 (1min)

What will be the value of each variable after each statement in the following program?

mass <- 47.5
age <- 122
mass <- mass * 2.3
age <- age - 20
  • mass = 47.5, age = 102
  • mass = 109.25, age = 102
  • mass = 47.5, age = 122
  • mass = 109.25, age = 122

USE CHALLENGE LINK ON ETHERPAD

3. Project Management in R

How Projects Tend To Grow

Good Practice

THERE IS NO ONE TRUE WAY (only principles)

  • Use a single working directory per project/analysis
    • easier to move, share, and find files
    • use relative paths to locate files
  • Treat raw data as read-only
    • keep in a separate subfolder (data?)
  • Clean data ready for work programmatically
    • keep cleaned/modified data in separate folder (clean_data?)
  • Consider output generated by analysis to be disposable
    • can be regenerated by running analysis/code

Example Directory Structure

Project Management in RStudio

  • RStudio tries to help you manage your projects
    • R Project concept - files and subdirectory structure
    • integration with version control
    • switching between multiple projects within RStudio
    • stores project history

Let’s create a project in RStudio

INTERACTIVE DEMO

Working in RStudio

We can write code in several ways in RStudio

  • At the console (you’ve done this)
  • In a script
  • As an interactive notebook
  • As a markdown file
  • As a Shiny app

We’re going to create a new dataset and R script.

  • Putting code in a script makes it easier to modify, share and run

INTERACTIVE DEMO

4. A First Analysis in RStudio

Our Task

  • Patients have been given a new treatment for arthritis
  • We have measurements of inflammation over a period of days for each patient
  • We want to produce a preliminary analysis and graphs for this data

Download the file from the following link to your data/ directory, and extract it

(the link is also available on the course Etherpad page)

Loading Data - Interactive Demo

  • You created data manually earlier, but this is rare
  • Data are most commonly read in from plain text files

Data files can be inspected in RStudio

read.csv(file = "data/inflammation-01.csv", header = FALSE)

Challenge 02 (2min)

Someone gives you a data file that has:

  • a comma (,) as the decimal point character
  • semi-colon (;) as the field separator

How would you open it, using read.csv()

Use the help function and documentation

USE CHALLENGE LINK ON ETHERPAD

Indexing Data

Indexing Data

INTERACTIVE DEMO

  • We use indexing to refer to elements of a matrix
    • square brackets: []
    • row, then column: [row, column]
data[1, 1]     # First value in dataset
data[30, 20]   # Middle value of dataset
  • To get a range of values, use the : separator (meaning ‘to’)
data[1:4, 1:4]   # rows 1 to 4; columns 1 to 4
  • To select a complete row or column, leave it blank
data[5, ]     # row 5
data[, 16]    # column 16

Summary Functions

INTERACTIVE DEMO

  • R provides useful functions to summarise data
  • We can use indexing to get summary information on individual patients and days
max(data)           # largest value in dataset
max(data[2, ])      # largest value for row (patient) 2
min(data[, 7])      # smallest value on column (day) 7
mean(data[, 7])     # mean value on day 7
sd(data[, 7])       # standard deviation of values on day 7

Repetitive Calculations

INTERACTIVE DEMO

  • Calculating for every patient (or day) this way is tedious

Computers exist to do tedious things for us

  • So apply a function (mean) to each row in the data:

  • R has several ways to automate this process

apply(X = data, MARGIN = 1, FUN = mean)
  • MARGIN = 1: rows
  • MARGIN = 2: columns
rowMeans(data)
colMeans(data)

Base Graphics

“The purpose of computing is insight, not numbers.” - Richard Hamming

  • R has many available graphics packages
    • graphically beautiful
    • specific problem domains
  • ‘built-in’ graphics are known as base graphics
  • Base graphics are powerful tools for visualisation and understanding

Plotting

INTERACTIVE DEMO

plot(avg_inflammation_patient)

max_day_inflammation <- apply(dat, 2, max)
plot(max_day_inflammation)

plot(apply(dat,2,min))       # 3 functions in one!

Challenge 03 (5min)

Can you add plots to your script showing:

  • scatterplot of standard deviation of inflammation across all patients, by day
  • a histogram of average inflammation across all patients, by day

5. Data Types and
Structures in R

Learning Objectives

  • Basic data types in R
  • Common data structures in R
  • How to find out the type/structure of R data
  • Understand how R’s data types and structures relate to your own data

Data Types and Structures in R

  • R is mostly used for data analysis
  • R has special types and structures to help you work with data
  • Much of the focus is on tabular data (data frames)

INTERACTIVE DEMO

Understanding data types, their uses, and how they relate to your own data is key to successful analysis with R

(it’s not just about programming)

What Data Types Do You Expect?

What data types would you expect to see?

What examples of data types can you think of from your own experience?

Please write them into the chat

Data Types in R

  • Data types in R are atomic
    • All data structures are built from these
  1. logical: TRUE, FALSE
  2. numeric:
    • integer: 3, 2L, 123456
    • double (decimal): 3.0, -23.45, pi
  3. complex: 3+0i, 1+4i
  4. character (text): "a", 'SWC', "This is not a string"
  5. raw: binary data (we won’t cover this)

INTERACTIVE DEMO

Challenge 04 (2min)

Create examples of data with the following characteristics:

  • name: answer, type: logical
  • name: height, type: numeric
  • name: dog_name, type: character

For each variable, test that it has the data type you intended

Four Common R Data Structures

  • vector
  • factor
  • list
  • data.frame

INTERACTIVE DEMO

Challenge 05 (5min)

Vectors are atomic: they can contain only a single data type

What data type are the following vectors (xx, yy, zz)?

xx <- c(1.7, "a")
yy <- c(TRUE, 2)
zz <- c("a", TRUE)

Options: logical, integer, numeric, character

USE CHALLENGE LINK ON ETHERPAD

Coercion

  • Coercion means changing data from one type to another
  • R will perform implicit coercion on vectors to make them atomic

logical \(\rightarrow\) integer \(\rightarrow\) double \(\rightarrow\) complex \(\rightarrow\) character

If there are formatting problems with your data, you might not have the type you expect when you import into R

  • Manual coercion with as.<type_name>()

INTERACTIVE DEMO

Factors

Data comes as one of two types:

  • quantitative: e.g. integers or real numbers
    (weight <- 17.2; rooms <- 7)
  • categorical: e.g. ordered or unordered classes
    (grade <- "8", coat <- "brindled")

This kind of distinction critical in many applications (e.g. statistical modelling)

  • Factors are special vectors that represent categorical data
    • Stored as vectors of labelled integers
    • Cannot be treated as strings/text

INTERACTIVE DEMO

Challenge 06 (5min)

Create a new factor, defining control and case experiments, and inspect the result:

f <- factor(c("case", "control", "case", "control", "case"))
str(f)
##  Factor w/ 2 levels "case","control": 1 2 1 2 1

In some statistical analyses in R it is important that the control level is numbered 1

  • Using the help available to you in RStudio, can you create a factor with the same values, but where the control level is numbered 1?

Lists

  • lists are like vectors, but can hold any combination of datatype
    • elements in a list are denoted by [[]] and can be named

INTERACTIVE DEMO

# create a list
l <- list(1, 'a', TRUE, matrix(0, nrow = 2, ncol = 2), f)
l_named <- list(a = "SWC", b = 1:4)

Logical Indexing

  • We have used indexing, slicing and names to get data by ‘location’
> animal[c(2,4,6)]
[1] "o" "k" "y"
> l_named$b
[1] 1 2 3 4
  • Logical indexes select data that meets certain criteria

INTERACTIVE DEMO

x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
mask <- c(TRUE, FALSE, TRUE, FALSE, TRUE)
x[mask]
x[x > 7]